# COMPSCI 389: Introduction to Machine Learning
# Model Evaluation

In this notebook we will consider ways of evaluating how effective supervised learning algorithms are.

Let's start with the imports that we will use in this notebook:

In [1]:
import pandas as pd
from sklearn.neighbors import KDTree
from sklearn.base import BaseEstimator
import numpy as np

# New this time:
from sklearn.model_selection import train_test_split    # For splitting into training and testing sets (more on this below!)

Next, let's load and display the GPA data set:

In [2]:
# Load the data set
df = pd.read_csv("https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas
#df = pd.read_csv("data/GPA.csv", delimiter=',')

# Display the data set
display(df)

# Split into X (inputs) and y (labels)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Unnamed: 0,physics,biology,history,English,geography,literature,Portuguese,math,chemistry,gpa
0,622.60,491.56,439.93,707.64,663.65,557.09,711.37,731.31,509.80,1.33333
1,538.00,490.58,406.59,529.05,532.28,447.23,527.58,379.14,488.64,2.98333
2,455.18,440.00,570.86,417.54,453.53,425.87,475.63,476.11,407.15,1.97333
3,756.91,679.62,531.28,583.63,534.42,521.40,592.41,783.76,588.26,2.53333
4,584.54,649.84,637.43,609.06,670.46,515.38,572.52,581.25,529.04,1.58667
...,...,...,...,...,...,...,...,...,...,...
43298,519.55,622.20,660.90,543.48,643.05,579.90,584.80,581.25,573.92,2.76333
43299,816.39,851.95,732.39,621.63,810.68,666.79,705.22,781.01,831.76,3.81667
43300,798.75,817.58,731.98,648.42,751.30,648.67,662.05,773.15,835.25,3.75000
43301,527.66,443.82,545.88,624.18,420.25,676.80,583.41,395.46,509.80,2.50000


Recall our nearest neighbor implementation:

In [3]:
class NearestNeighbor(BaseEstimator):
    def fit(self, X, y):
        # Convert X and y to NumPy arrays if they are DataFrames. 
        # This makes fit compatible with numpy arrays or DataFrames
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(y, pd.Series):
            y = y.values

        # Store the training data and labels.
        self.X_data = X
        self.y_data = y
        
        # Create a KDTree for efficient nearest neighbor search
        self.tree = KDTree(X)

        return self

    def predict(self, X):
        # Convert X to a NumPy array if it's a DataFrame
        if isinstance(X, pd.DataFrame):
            X = X.values

        # Query the tree for the nearest neighbors of all points in X.
        # ind will be a 2D array where ind[i,j] is the index of the 
        # j'th nearest point to the i'th row in X.
        dist, ind = self.tree.query(X, k=1)

        # Extract the nearest labels.
        # ind[:,0] are the indices of the nearest neighbors to each 
        # query (each row in x))
        return self.y_data[ind[:,0]]        

## Evaluation

Now that we have created our first ML algorithm, how we can we determine how effective it is?

> **Idea**: Run the model on many data points and compute the average error.

Let's do this:

In [4]:
# Train the model on the data
model = NearestNeighbor()
model.fit(X, y)
predictions = model.predict(X)

# Compute the average error
average_error = (predictions - y).mean()

print("Average Error:", average_error)

Average Error: 0.0


### The Illusion of Perfect Predictions

We've seemingly achieved perfect predictions with our model! But let's pause and reflect.

**Question**: Are our predictions genuinely perfect?

**Answer**: Not quite. There's a fundamental problem with our approach: we evaluated our model's performance using the **same data** we used to train it.

#### Why This Evaluation Is Misleading

Evaluating a model on the training data answers the question:

> How well does our model predict outcomes for data it has already seen?

But the real question we want to answer is:

> How well can our model predict outcomes for new, unseen data?

This is not only a problem for the NN algorithm (although it is particularly clear in this case). This problem arises when you evaluate *any* ML algorithm using the same data (or some of the same data) that was used to train it.

#### Train/Test Splits

To accurately assess a model's performance, we need to test it on data that it hasn't seen during training. This is where **train/test splits** come into play.

- **Training Set**: A subset of the data used to train the model. The model learns to make predictions based on this data.
- **Testing Set**: A separate subset used to evaluate the model. This set is not used during training and thus provides an unbiased evaluation of the model's performance on new data.

By splitting our data into these two sets, we can train our model on one portion and then test its predictions on another, unseen portion. This approach gives us a more realistic measure of how well our model will perform in real-world scenarios, where it encounters data it hasn't seen before.

This raises the question: If we have `data_size` points (rows), how many should we use for training and how many for testing?

- If we use too much for training, our evaluation will have high variance (it will not be reliable).
- If we use too little for training, the models we learn will not perform well.

Although there is some research studying how to split data into training and testing sets, the *vast* majority of the time people pick a split like 50/50, 60/40, 40/60, 80/20, 20/80, etc. based on their intuition about how much data their algorithm needs to produce a good model and how much data will be needed for evaluation. Let's use 80% of our data for training and 20% for testing.

**Question**: If we take the first X% for training and the last (100-X)% for testing, what's something we should watch out for in real applications?

**Answer**: Sometimes data sets are provided in some sort of order. For example, the student data could be sorted by GPA. We don't want to put all of the high-GPA points into training and the low-GPA points into testing, since that would also bias our evaluation. We therefore randomly select which points go into the training and testing sets. We will use the `train_test_split` function from scikit-learn which does this for us (when `shuffle=True`).

In [5]:
# We already loaded X and y, but do it again as a reminder
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split the data into training and testing sets (60% train, 40% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True)

# Display the training data.
display(X_train)

# Train the model on the training data
model.fit(X_train, y_train)

# Predict on the testing data
predictions = model.predict(X_test)

# Compute the average error on the testing data
average_error = (predictions - y_test).mean()

print("Average Error:", average_error)

Unnamed: 0,physics,biology,history,English,geography,literature,Portuguese,math,chemistry
28005,405.95,441.07,543.93,642.68,450.98,525.70,561.74,429.42,447.41
17302,509.22,514.19,476.64,518.26,470.93,560.50,449.40,393.80,410.14
27044,505.69,586.47,676.95,740.67,557.43,573.57,684.98,626.14,558.67
31774,432.28,621.59,515.54,469.30,537.18,549.01,426.03,520.14,566.83
1483,670.08,515.43,598.85,686.78,636.61,652.86,508.80,635.35,819.01
...,...,...,...,...,...,...,...,...,...
18630,447.26,419.96,514.69,360.20,576.42,435.27,458.60,422.37,437.57
23873,679.87,713.81,679.53,708.97,678.52,739.79,724.57,653.90,767.87
10573,584.54,677.47,613.97,543.48,560.80,579.90,510.76,489.41,484.16
10922,607.08,611.81,491.59,577.75,557.43,644.87,516.95,465.13,558.67


Average Error: 0.005091800554208521


Notice that the training data is a new DataFrame that maintains the indices from the original DataFrame. This makes it easier to look up corresponding values (e.g., in the label Series). Try setting `shuffle=false` in the previous python cell, and notice how it changes the indices. Turn shuffling back on (and re-run the cell again) before continuing.

## Evaluation Metrics

Wow, these predictions are *really* good! Given the 9 entrance exam scores we can predict a new applicants GPA to within a couple *thousandths* of a GPA point!

Lets look at some of these super-accurate predictions:

In [6]:
# The predictions are a numpy array. Convert them to a Series
predictions_series = pd.Series(predictions, name='prediction')

# Calculate the difference
difference = predictions_series - y_test

# Create a new DataFrame
temp = pd.DataFrame({
    'label': y_test,
    'prediction': predictions_series,
    'difference': difference
})

print(temp)

         label  prediction  difference
0          NaN     3.01000         NaN
1          NaN     3.64333         NaN
2          NaN     3.86000         NaN
3          NaN     2.83333         NaN
4          NaN     2.51667         NaN
...        ...         ...         ...
43289  3.75667         NaN         NaN
43291  3.08667         NaN         NaN
43297  3.63333         NaN         NaN
43298  2.76333         NaN         NaN
43300  3.75000         NaN         NaN

[27734 rows x 3 columns]


Wait, why are we getting NaN? Notice that this DataFrame has 43,300 rows, which is roughly the total number of data points. This should only be the length of the testing set, which is far smaller!

What's happening is that `train_test_split` preserves the original indexing when producing `y_test`. So, although `y_test.size == 17322`, the indexes of `y_test` span from 0 to 43,302. This is useful in cases where you want to match up labels in `y_test` to their corresponding rows in the original data set. 

We can use `reset_index(drop=True)` [[link]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to reset the indices in `y_test` to the default indexing of 0, 1, 2, ...

In [7]:
# The predictions are a numpy array. Convert them to a Series
predictions_series = pd.Series(predictions, name='prediction')
y_test_series = pd.Series(y_test, name='label').reset_index(drop=True) # We reset the indices in y_test. drop=True means to discard the old indices. If False, it keeps the old index as a new column rather than discarding it.

# Calculate the difference
difference = predictions_series - y_test_series

# Create a new DataFrame
temp = pd.DataFrame({
    'label': y_test_series,
    'prediction': predictions_series,
    'difference': difference
})

print(temp)

         label  prediction  difference
0      2.38000    3.010000    0.630000
1      2.94333    3.643330    0.700000
2      3.58333    3.860000    0.276670
3      3.79000    2.833330   -0.956670
4      2.44333    2.516670    0.073340
...        ...         ...         ...
17317  2.87000    3.500000    0.630000
17318  3.92667    3.593330   -0.333340
17319  2.03667    0.473333   -1.563337
17320  3.62333    0.266667   -3.356663
17321  3.35667    2.583330   -0.773340

[17322 rows x 3 columns]


Something's wrong here! These aren't that accurate. Almost all are off by way more than a few thousandths of a GPA point. 

Before going on, note that we could obtain these values (in a different order) with the following. Here `predictions` is a numpy array, while `y_test` is a Series, so the result is a Series. The Series includes index information. Notice that the indices are not in order. The previous discussion should make it clear why this is.

In [8]:
print(predictions - y_test)

23161    0.630000
2278     0.700000
24792    0.276670
30070   -0.956670
35701    0.073340
           ...   
19475    0.630000
10036   -0.333340
7515    -1.563337
28792   -3.356663
12557   -0.773340
Name: gpa, Length: 17322, dtype: float64


**Question**: These errors seem bigger than expected based on our evaluation. What are we doing wrong?

**Answer**: We are computing the mean error (or average error), which lets positive and negative errors cancel out! This measures whether we are on average over-predicting or under-predicting. We are (on average) under-predicting by a slight amount.

There are several alternative metrics that can better quantify the accuracy of a model for a regression problem. We review four of the most common:

#### Mean Squared Error (MSE)

MSE measures the average of the squares of the errors. It gives a higher weight to larger errors, making it sensitive to outliers. It's useful when large errors are particularly undesirable.

$$\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2,$$

where $n$ is the size of the testing set, $y_i$ is the $i^\text{th}$ label, and $\hat y_i$ is the $i^\text{th}$ prediction.

#### Root Mean Squared Error (RMSE)

RMSE is the square root of MSE. It has the same units as the target variable (the same scale), making  it easier to interpret. Like MSE, it gives more weight to larger errors.

$$\operatorname{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2}.$$

#### Mean Absolute Error (MAE)
MAE measures the average magnitude of the errors in a set of predictions, without considering their sign. It's less sensitive to outliers compared to MSE and RMSE (this can be a good thing or a bad thing, depending on your application).

$$\operatorname{MAE}=\frac{1}{n}\sum_{i=1}^n \left \vert y_i - \hat y_i \right \vert.$$

#### R-squared ($R^2$)

R-squared, or the *coefficient of determination*, indicates the proportion of the variance of the dependent variable (labels) that is predictable from the independent variables (predictions). Unlike the other metrics, a higher $R^2$ indicates a better fit.

$$R^2=1-\frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2},$$

where $\bar y = \frac{1}{n}\sum_{i=1}^n y_i$ is the average label. In this equation the numerator measures the unexplained variance by the model and the denominator measures the total variance in the actual labels.


Let's create functions for computing these different metrics given an array or Series of predictions and labels.

In [9]:
def mean_squared_error(predictions, labels):
    return np.mean((predictions - labels) ** 2)

def root_mean_squared_error(predictions, labels):
    return np.sqrt(mean_squared_error(predictions, labels))

def mean_absolute_error(predictions, labels):
    return np.mean(np.abs(predictions - labels))

def r_squared(predictions, labels):
    ss_res = np.sum((labels - predictions) ** 2)        # ss_res is the "Sum of Squares of Residuals"
    ss_tot = np.sum((labels - np.mean(labels)) ** 2)    # ss_tot is the "Total Sum of Squares"
    return 1 - (ss_res / ss_tot)

Let's use these functions to test how well our NN algorithm works on the GPA data set.

In [10]:
# Compute the average error and other metrics on the testing data
average_error = (predictions - y_test).mean()
mse = mean_squared_error(predictions, y_test)
rmse = root_mean_squared_error(predictions, y_test)
mae = mean_absolute_error(predictions, y_test)
r2 = r_squared(predictions, y_test)

# Print the metrics
print("Average Error:", average_error)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("R-squared:", r2)

Average Error: 0.005091800554208521
Mean Squared Error: 1.1341230826549182
Root Mean Squared Error: 1.0649521504062605
Mean Absolute Error: 0.8192533543932571
R-squared: -0.6876057456137379


These give a much clearer picture of how accurate the model is. Some area easier to interpret than others, but all can be used to compare the performance of different ML methods.